Scope

This document aims to filter out poor quality cells and genes. This is an standard first QC step for the scRNAseq data pre-processing analysis. Commonly used QC metrics include the exploration of number of unique genes detected in each cell or the percentage of reads that map to the mitochondrial genome.

Primary data

A total of six samples were library-prepared and sequenced by NovoGene Co, Ltd. Libraries were prepared using 10x Single Cell 3’v3 kit. Corresponding biological samples were delivered to IJC at the end of October’23 and sequencing data was received at the end of February’24. Samples are equally divided in two conditions, considering cells from mouse embryoid bodies (mEBs 144h) :

  • in wild-type (WT) condition (3x) and,
  • after activation of 7* specific genes (7g) (3x) identified by CRISPRa in an earlier project phase.

*Genes related to HSC development fate.

Methods

Following criteria is applied:

  • Cells with mitochondrial content higher than 10% are discarded.
  • Cells with less than 1k counts are discarded.
  • Cells with more than 7k genes or less than 300 genes are discarded.

Ribosomal genes are removed from the dataset. Additionally, those genes expressed in 10 or less cells (among all samples) are also discarded.

QC is performed by means of Seurat R package (Hao et al. 2023) (v5.0.0).

REMARK: Cells classified as doublets/multiplets per sample, identified in a previous analysis, will be also discarded in this analysis.

Results

All scRNAseq data samples were merged into a Seurat object

## An object of class Seurat 
## 32285 features across 47177 samples within 1 assay 
## Active assay: RNA (32285 features, 0 variable features)
##  6 layers present: counts.WT_s1, counts.WT_s2, counts.WT_s3, counts.g7_s1, counts.g7_s2, counts.g7_s3

The complete dataset, without any filtering applied, includes a total of 32,285 features (genes) over 47,177 samples (cells) distributed in 6 samples from two conditions.

Doublets removal

Prior to explore QC quality level of the cells, those already classified as doublets are removed. Doublets identification was independently conducted per sample.

Number of cells per sample: initially and after doublets removal
Sample.ID Initial.Cells Cells.woDBL
g7_s1 8785 8032
g7_s2 7601 6997
g7_s3 7109 6535
WT_s1 7757 7190
WT_s2 7214 6503
WT_s3 8711 8003

QC metrics

Exploration

Prior to apply any QC filtering, first a general exploration is conducted to assess the overall cells quality. This includes, per barcode (cell):

  • Number of genes
  • Number of counts
  • Mitochondrial content
  • Ribosomal content

Alternatively, a joint plot between metrics is also of interest to check relationships among variables.

All samples behave roughly equally independently of their condition.

Removal criteria with MADs

Explore possible cutoffs based on four MADs:

Subset low quality cells

Cells are subset based on abovementioned QC criteria. The final number of cells is shown in the following table.

Number of cells per sample: initially, after doublets removal and after final QC criteria
Sample.ID Initial.Cells Cells.woDBL Cells.FINAL
g7_s1 8785 8032 6280
g7_s2 7601 6997 5625
g7_s3 7109 6535 5083
WT_s1 7757 7190 5955
WT_s2 7214 6503 5503
WT_s3 8711 8003 6504

Of note, most of the cells discarded is because of the mitochondrial percentage. After filtering, QC plots are again visualized:

Discard non-expressed and ribosomal genes

Finally, those genes that are expressed in less than 10 (considering all samples together) will be discarded. The number of cells expressing a particular gene, per sample, is computed:

Example of number of cells expressing a particular gene (in rows) per sample (columns)
counts.WT_s1 counts.WT_s2 counts.WT_s3 counts.g7_s1 counts.g7_s2 counts.g7_s3
Xkr4 931 806 869 1165 1086 1005
Gm1992 34 39 44 67 67 69
Gm19938 115 119 121 163 183 190
Gm37381 1 2 1 2 6 0
Rp1 31 19 23 19 17 20
Sox17 453 424 447 372 310 370
Gm37587 5 3 9 4 3 6
Gm37323 1 0 1 1 1 0
Mrpl15 5039 4554 5207 5038 4761 4416
Lypla1 2537 2379 2362 2513 2474 2521

As a reminder, a total of 32285 genes were initially included in the expression matrix (for all samples). Out of those, 22591 are kept for downstream analysis, after discarding the non-expressed ones (or just residual).

Finally, ribosomal genes are removed from this dataset. The number of present ribosomal genes are 99 genes.

The final dataset includes a total of 22492 features (genes) over 34950 cells distributed in 6 samples from two conditions.

Session Information

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Madrid
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Seurat_5.0.0       SeuratObject_5.0.0 sp_2.1-1           ggpubr_0.6.0      
## [5] stringr_1.5.0      dplyr_1.1.4        data.table_1.14.8  ggplot2_3.4.4     
## [9] knitr_1.44        
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     rstudioapi_0.15.0      jsonlite_1.8.8        
##   [4] magrittr_2.0.3         spatstat.utils_3.0-4   ggbeeswarm_0.7.2      
##   [7] farver_2.1.1           rmarkdown_2.24         vctrs_0.6.5           
##  [10] ROCR_1.0-11            spatstat.explore_3.2-5 rstatix_0.7.2         
##  [13] htmltools_0.5.6        broom_1.0.5            sass_0.4.7            
##  [16] sctransform_0.4.1      parallelly_1.36.0      KernSmooth_2.23-21    
##  [19] bslib_0.5.1            htmlwidgets_1.6.2      ica_1.0-3             
##  [22] plyr_1.8.8             plotly_4.10.2          zoo_1.8-12            
##  [25] cachem_1.0.8           igraph_1.5.1           mime_0.12             
##  [28] lifecycle_1.0.4        pkgconfig_2.0.3        Matrix_1.6-1          
##  [31] R6_2.5.1               fastmap_1.1.1          fitdistrplus_1.1-11   
##  [34] future_1.33.0          shiny_1.7.5            digest_0.6.33         
##  [37] colorspace_2.1-0       patchwork_1.2.0        tensor_1.5            
##  [40] RSpectra_0.16-1        irlba_2.3.5.1          labeling_0.4.3        
##  [43] progressr_0.14.0       fansi_1.0.4            spatstat.sparse_3.0-3 
##  [46] httr_1.4.7             polyclip_1.10-4        abind_1.4-5           
##  [49] compiler_4.3.1         withr_2.5.2            backports_1.4.1       
##  [52] carData_3.0-5          fastDummies_1.7.3      ggsignif_0.6.4        
##  [55] MASS_7.3-60            tools_4.3.1            vipor_0.4.7           
##  [58] lmtest_0.9-40          beeswarm_0.4.0         httpuv_1.6.11         
##  [61] future.apply_1.11.0    goftest_1.2-3          glue_1.6.2            
##  [64] nlme_3.1-162           promises_1.2.1         grid_4.3.1            
##  [67] Rtsne_0.16             cluster_2.1.4          reshape2_1.4.4        
##  [70] generics_0.1.3         gtable_0.3.4           spatstat.data_3.0-3   
##  [73] tidyr_1.3.0            car_3.1-2              utf8_1.2.3            
##  [76] spatstat.geom_3.2-7    RcppAnnoy_0.0.21       ggrepel_0.9.5         
##  [79] RANN_2.6.1             pillar_1.9.0           spam_2.10-0           
##  [82] RcppHNSW_0.5.0         later_1.3.1            splines_4.3.1         
##  [85] lattice_0.21-8         survival_3.5-5         deldir_1.0-9          
##  [88] tidyselect_1.2.0       miniUI_0.1.1.1         pbapply_1.7-2         
##  [91] gridExtra_2.3          scattermore_1.2        xfun_0.40             
##  [94] matrixStats_1.0.0      stringi_1.7.12         lazyeval_0.2.2        
##  [97] yaml_2.3.7             evaluate_0.21          codetools_0.2-19      
## [100] tibble_3.2.1           cli_3.6.2              uwot_0.1.16           
## [103] xtable_1.8-4           reticulate_1.34.0      munsell_0.5.0         
## [106] jquerylib_0.1.4        Rcpp_1.0.11            globals_0.16.2        
## [109] spatstat.random_3.2-1  png_0.1-8              ggrastr_1.0.2         
## [112] parallel_4.3.1         ellipsis_0.3.2         dotCall64_1.1-0       
## [115] listenv_0.9.0          viridisLite_0.4.2      scales_1.3.0          
## [118] ggridges_0.5.4         leiden_0.4.3           purrr_1.0.2           
## [121] rlang_1.1.2            cowplot_1.1.1

References

Hao, Yuhan, Tim Stuart, Madeline H Kowalski, Saket Choudhary, Paul Hoffman, Austin Hartman, Avi Srivastava, et al. 2023. “Dictionary Learning for Integrative, Multimodal and Scalable Single-Cell Analysis.” Nature Biotechnology. https://doi.org/10.1038/s41587-023-01767-y.